feat(server): add target-layer-split backend adapter path#265
Conversation
80108ba to
902cf3b
Compare
There was a problem hiding this comment.
2 issues found across 27 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
902cf3b to
6fd8d9e
Compare
ed8d5e2 to
73c4a85
Compare
There was a problem hiding this comment.
5 issues found across 32 files
You’re at about 91% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/backend_factory.cpp">
<violation number="1" location="server/src/common/backend_factory.cpp:55">
P2: Layer-split qwen35 path silently drops multiple runtime/decode options present in non-split path</violation>
</file>
<file name="server/test/test_server_unit.cpp">
<violation number="1" location="server/test/test_server_unit.cpp:1366">
P2: Restore-path test expects DFlash on snapshot restore, but the intended contract is AR fallback until shard replay exists.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| cfg.device = args.device; | ||
| cfg.draft_gpu = args.draft_device.gpu; | ||
| cfg.remote_draft = args.remote_draft; | ||
| cfg.fa_window = args.fa_window; |
There was a problem hiding this comment.
P2: Layer-split qwen35 path silently drops multiple runtime/decode options present in non-split path
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/backend_factory.cpp, line 55:
<comment>Layer-split qwen35 path silently drops multiple runtime/decode options present in non-split path</comment>
<file context>
@@ -42,6 +45,30 @@ std::unique_ptr<ModelBackend> create_backend(const BackendArgs & args) {
+ cfg.device = args.device;
+ cfg.draft_gpu = args.draft_device.gpu;
+ cfg.remote_draft = args.remote_draft;
+ cfg.fa_window = args.fa_window;
+ cfg.kq_stride_pad = args.kq_stride_pad;
+ cfg.draft_ctx_max = args.draft_ctx_max;
</file context>
| GenerateResult restored = backend.restore_and_generate(2, restore_req, io); | ||
|
|
||
| TEST_ASSERT(restored.ok); | ||
| TEST_ASSERT(raw->dflash_called); |
There was a problem hiding this comment.
P2: Restore-path test expects DFlash on snapshot restore, but the intended contract is AR fallback until shard replay exists.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/test/test_server_unit.cpp, line 1366:
<comment>Restore-path test expects DFlash on snapshot restore, but the intended contract is AR fallback until shard replay exists.</comment>
<file context>
@@ -1184,6 +1188,211 @@ static void test_normalize_responses_tool_followup_messages() {
+ GenerateResult restored = backend.restore_and_generate(2, restore_req, io);
+
+ TEST_ASSERT(restored.ok);
+ TEST_ASSERT(raw->dflash_called);
+ TEST_ASSERT(raw->restored_slot == 2);
+ TEST_ASSERT(!raw->reset_called);
</file context>
| TEST_ASSERT(raw->dflash_called); | |
| TEST_ASSERT(!raw->dflash_called); |
Record the clean integration of PRs Luce-Org#265 and Luce-Org#273, the refreshed conflict probes for the remaining selective-port PRs, and the current validation results.
Summary
This PR exposes target layer split through the native C++ server and makes the layer-split adapter path reusable instead of qwen35-owned.
Before this change, the repository already had qwen35 target-layer split machinery in the bench / daemon path, but
dflash_serverstill rejected--target-devicesand--target-layer-split. The split implementation also carried some generic concepts inside qwen35-specific code, which would make Gemma4 and future model adapters repeat the same load-plan, shard metadata, peer-access, and snapshot setup logic.This PR moves the shared target-layer-split flow into a generic server-facing backend and common layer-split scaffold. qwen35 becomes the first concrete adapter on top of that scaffold.
Changes
LayerSplitBackendfor the shared target-sharding request flow used bydflash_server.LayerSplitAdapter, so each model family supplies only its model-specific partial load, cache, forward, compression, and optional spec-decode behavior.LayerSplitRange,LayerSplitLoadPlan, andLayerSplitShardMeta.TargetLoadPlan/TargetLayerSplitShardconcepts from the shared flow.Qwen35LayerSplitAdapterand routes qwen35 multi-GPU placement throughLayerSplitBackend(Qwen35LayerSplitAdapter).Qwen35Backend.Qwen35LayerSplitShardandrun_qwen35_layer_split_forward.--kv-cache-dirwith--target-devicesinstead of silently clearing the disk-cache setting. Disk prefix cache still needs a sharded snapshot format before it can safely support split targets.keep_ratiobefore passing the request into model adapters.draft_swa_windowinto the qwen35 layer-split adapter so split and non-split qwen35 paths stay aligned for that runtime setting.LayerSplitBackend::shutdown()idempotent and clears qwen35 shard state during teardown.Notes
LayerSplitBackendby defining only their model-specific shard payload, partial loader, cache, forward, snapshot, and optional DFlash/PFlash hooks.test_server_unitpasses on both builds.